Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Variant Discovery ◾ 113

unique, and stable, unlike descriptive names, which can be used differently by different

people. For example, the NCBI dbSNP assigns an ID with “rs” prefix to the accepted human

variants with asserted positions mapped to a reference sequence as reference variants

(RefSNP) and also it assigns an ID with “ss” prefix for a variant submitted with flanking

sequence. Figure 4.1 shows two dbSNP IDs for reference variants in the VCF ID columns.

Variants may have identifiers from multiple databases. You will see these different types of

identifiers used throughout the literature and in other databases. Different types of identi-

fiers are used for short variants and structural variants.

4.1.2. Variant Calling and Analysis

Variant calling is the process by which we can identify variants on sequence data. The

sequence data are usually stored in FASTQ files obtained from whole genome, whole

exome sequencing, or targeted gene sequencing. The reads in the FASTQ files are assessed

for quality and then preprocessed to ensure that final reads are of high quality. The reads

are then aligned to a reference genome and the read alignment information are stored in

BAM files. The BAM files are then used as an input for variant calling programs for vari-

ant identification and analysis. The identified variants are written in a VCF file. A single

VCF file can hold thousands of variants and genotypes of multiple samples. The genetic

studies usually focus on the germline variant calling, where the reference genome used for

mapping the reads is standard for the species of interest; that will allow us to identify geno-

types. The somatic variant calling is used to study diseases like cancer. In somatic variant

calling, the reference is a related tissue from the same individual (e.g., healthy tissue in the

case of cancer). Here, we expect to see genetic mosaicism between cells or presence of more

than one genetic line as a result of genetic mutations.

A variant calling workflow begins with raw sequencing data for multiple samples or

individuals and ends with a single VCF file containing only the genomic positions where at

least one individual in the population has a variant due to mutations. After variants have

been called, they can be analyzed in different ways. For example, we may wish to deter-

mine which genes are affected by the variants, what consequences they have on them, and

the phenotypes associated with them. Thus, variants that have been called can be anno-

tated with their consequences and can also be associated to certain phenotypes and the

results can be interpreted to answer some research questions.

Before digging into the steps of variant calling and analysis, it is better to distin-

guish between the types of genetic variation studies. There are several types of genetic

variation studies but generally they can be classified into (i) Genome-wide association

studies (GWASs) [3], (ii) studies on consequences of variants [4], and (iii) Population

genetics [5].

The GWASs involve genotyping a sample of individuals at common variants across the

genome using a genome-wide survey for variants. Variants associated with a phenotype

will be found at a higher frequency. This kind of studies are carried out on individuals to

identify variants and their associated phenotypes as variants causing the phenotype will be

at higher frequency in the affected individual than in the control. The phenotype–genotype

associations must be supported by statistical evidence based on the population studied.